2 Python essentials - files and pandas

Now that you have reviewed the basic functionality, let’s look at how to read/write files, and how to use some of the most common packages (e.g., pandas, matplotlib, …).

Reading and writing files

Reading and writing files is an important aspect of programming. Let’s imagine you have the following list of molecules:

molecules = ['Amigdalin', 'Fenfuram', 'Estradiol', '2-Methylbutanol']

and you would like to save it in a text file, one name per line.

To do so, you could run the following code:

molecules = ['Amigdalin', 'Fenfuram', 'Estradiol', '2-Methylbutanol']

with open('molecules.txt', 'w') as file:
    file.write('\n'.join(molecules))

Here, 'molecules.txt' is the name of the file you want to write to. The 'w' in 'w' is for writing to the file. If the file does not exist, it will be created. If the file exists, its contents will be overwritten.

The write() method is used to write to the file. You pass a string to the write() method, and it writes that string to the file.

The with statement is used to make sure that the file is closed properly after you are done writing to it.

The string that is written to the file is '\n'.join(molecules), where '\n' is a line break and the .join(molecules) statement joins all the elements in the molecules list with '\n'. We can print it below:

print('\n'.join(molecules))

If we want to read the molecules.txt file that we have created and get back the list of molecules. We use a similar synthax.

with open('molecules.txt', 'r') as file:
    molecules = [molecule for molecule in file.readlines()]
print(molecules)

Here, 'molecules.txt' is the name of the file you want to read. The r in r is for reading the file (as opposed to writing to it).

The open() function returns a file object that you can use to read the file. The readlines() method is used to read the contents of the file, line by line. The lines are returned as a string.

However, as you can see, it still contains the line break character \n at the end of each line.

Hence, we apply the .strip() method to remove whitespace in front and at the end of the string.

with open('molecules.txt', 'r') as file:
    molecules = [molecule.strip() for molecule in file.readlines()]
print(molecules)

Pandas

Pandas is a popular library for data analysis and manipulation in Python. It provides a way to store and manipulate data in a tabular form, similar to a spreadsheet.

Here’s how to use pandas for reading and writing files:

import pandas as pd

# Reading a CSV file
df = pd.read_csv('file.csv')
print(df)

# Reading an Excel file
df = pd.read_excel('file.xlsx')
print(df)

# Writing a CSV file
df.to_csv('file.csv', index=False)

# Writing an Excel file
df.to_excel('file.xlsx', index=False)

Here, 'file.csv' and 'file.xlsx' are the names of the files you want to read or write. You will need to replace them with the actual names of the files you are working with.

The pd.read_csv() function is used to read a CSV file, and the pd.read_excel() function is used to read an Excel file. The functions return a pandas DataFrame (this is what df stands for), which is a two-dimensional labeled data structure with columns of potentially different types.

The to_csv() and to_excel() methods are used to write a DataFrame to a CSV file and an Excel file, respectively.

The index argument is used to specify whether or not to write the index of the DataFrame to the file. If index is set to False, the index will not be written.

Because pandas is not a standard Python library, we have to import the module. This the reason for the first line import pandas as pd.

If you want to go more in-depth, there is also a great pandas tutorial on Kaggle Learn.

ESOL dataset

Let’s download a dataset containing molecules. We will use the Estimated SOLubility (ESOL) dataset by Delaney, as preprocessed by DeepChem.

The Delaney dataset is a collection of small organic molecules with experimental solubility values in water. It is often used as a benchmark dataset for testing and evaluating the performance of machine learning models in predicting the solubility of molecules. The dataset contains 1,084 molecules and a variety of molecular properties, including the solubility value, molecular weight, and atom-level information such as the number of atoms and types of bonds. We will download a .csv file.

!wget "https://raw.githubusercontent.com/schwallergroup/ai4chem_course/main/notebooks/01%20-%20Basics/data/delaney-processed.csv"

The delaney-processed.csv file was successfully downloaded.

Here, we used the wget command with a ! in front. In Jupyter notebooks, the ! symbol is used to run shell/terminal commands directly from the notebook.

You can use it to check the version of Python that you’re using:

!python --version

This feature is useful when you want to run shell commands directly from the notebook, without having to switch to a terminal or command prompt.

It’s important to note that the ! symbol only works in Jupyter notebooks, and not in regular Python scripts.

Exercise 01b_01

To practice this let’s use some more useful shell commands.

ls - lists files in folder
head file - show the first lines of the file
mv file target_location - move file to target location
mkdir folder_name make a new folder

So, your use those commands to: 1. See if the delaney-processed.csv is in the current folder (ls) 2. Make a new folder called data (mkdir) 3. Move the delaney-processed.csv file to the data folder (mv)

# write your code here, don't forget the `!` for shell commands



# if you want to see the solution, uncomment the following line
# %load https://raw.githubusercontent.com/schwallergroup/ai4chem_course/main/notebooks/solutions/solution_01b_01.txt

Once you have correctly moved delaney-processed.csv to the data folder. You should be able to look at the first lines of the file with the following command.

!head data/delaney-processed.csv

Handling the ESOL dataset with pandas

Exercise 01b_02

You see above that the .csv file contains comma-separated values. So, let’s use pandas to:

read the file into a DataFrame (use theread_csv function and assign it to the df variable)
show the first 5 rows using df.head()

# Let's assume, we have not yet imported pandas
# Start by importing the pandas module



# Read the ESOL dataset into a DataFrame



# Inspect the first 5 rows of the DataFrame

# For the solution, uncomment the following line:
# %load https://raw.githubusercontent.com/schwallergroup/ai4chem_course/main/notebooks/solutions/solution_01b_02.py

#On Google Colab you can run this command to make the dataframe interactive.
%load_ext google.colab.data_table

df

Exercise 01b_03

Let’s focus on a single colum for the moment.

You can access a specific column of the DataFrame by using square bracket notation and the name of the column, such as df["measured log solubility in mols per litre"].

You can also perform operations on a specific column of the DataFrame, such as calculating the mean value, by using methods such as mean().

Try this in the code cell below, and save the mean solubility in the variable mean_solubility.

# Access a the "measured log solubility in mols per litre" column


# Calculate mean statistics on that column

# For the solution, uncomment the following line:
# %load https://raw.githubusercontent.com/schwallergroup/ai4chem_course/main/notebooks/solutions/solution_01b_03.py

Once you have this you could the following example:

# Selecting rows based on conditions
high_solubility = df[df["measured log solubility in mols per litre"] > mean_solubility]
print(high_solubility)

# Adding a new column to the DataFrame
df["Solubility Class"] = "Low"
df.loc[df["measured log solubility in mols per litre"] > mean_solubility, "Solubility Class"] = "High"
print(df.head())

# Grouping data by a column
grouped = df.groupby("Solubility Class")
print(grouped.mean())

# Sorting the DataFrame
df.sort_values("measured log solubility in mols per litre", ascending=False, inplace=True)
print(df.head())

In this example, you can see how to select rows based on conditions using boolean indexing, add a new column to the DataFrame, group data by a column, sort the DataFrame, and write the DataFrame to a CSV file.

The df[df["measured log solubility in mols per litre"] > mean_solubility] line selects rows from the DataFrame where the Solubility column is greater than the mean solubility.

The df["Solubility Class"] = "Low" line adds a new column to the DataFrame, and the df.loc[df["measured log solubility in mols per litre"] > mean_solubility, "Solubility Class"] = "High" line sets the values in the new column based on conditions.

The grouped = df.groupby("Solubility Class") line groups the data by the Solubility Class column, and the grouped.mean() line calculates the mean value of each group.

The df.sort_values("Solubility", ascending=False, inplace=True) line sorts the DataFrame in descending order based on the Solubility column.

That’s a brief overview of some of the basic operations you can perform on the ESOL dataset using pandas.

Additional pandas functionality

You can do many more things with pandas as also described in the documentation. Most of the questions that you will have will certainly will already have been answered by someone on StackOverflow, and you could use tools like ChatGPT as a personal interactive tutor.

Here, just some additional examples how you could merge/concatenate DataFrames:

# Merging DataFrames
df1 = df[["Compound ID", "measured log solubility in mols per litre"]]
df2 = df[["Compound ID", "SMILES"]]
merged = pd.merge(df1, df2, on="Compound ID")
print(merged.head())

# Concatenating DataFrames
df3 = df[["Compound ID", "measured log solubility in mols per litre"]].head(10)
df4 = df[["Compound ID", "SMILES"]].tail(10)
concatenated = pd.concat([df3, df4])
print(concatenated)

Or handle missing values in your DataFrame (there are no missing values in the ESOL dataset).

# Handling missing values
df["Compound ID"].fillna("", inplace=True)
print(df.head())